Skip to content

Agentic builds: dsynth evidence capture hooks#1517

Open
tuxillo wants to merge 93 commits into
masterfrom
agentic-dsynth-evidence-hooks
Open

Agentic builds: dsynth evidence capture hooks#1517
tuxillo wants to merge 93 commits into
masterfrom
agentic-dsynth-evidence-hooks

Conversation

@tuxillo
Copy link
Copy Markdown
Member

@tuxillo tuxillo commented Jan 10, 2026

Goal

We are designing a system to automatically (agent-assisted) fix ports while keeping the existing, build-driven workflow intact:

  • dsynth stays the authoritative build executor.
  • On failure, we capture a bounded evidence bundle (distilled errors + small port context) so automated triage/patch generation can be driven by real build output without dumping huge logs or entire work directories into an AI context.
  • Evidence is intended to flow into an asynchronous agent pipeline (triage → patch → review) via a central queue (documented), so builds never block on AI availability.

What this PR adds (foundation)

  • dsynth hook scripts under scripts/dsynth-hooks/:
    • hook_run_start / hook_run_end group failures per build run and snapshot dsynth summary lists.
    • hook_pkg_failure creates a per-failure evidence bundle with:
      • logs/errors.txt (high-signal extract, capped at 200KB)
      • logs/full.log.gz (full log preserved for humans)
      • port/* snapshot (Makefile/distinfo/pkg-plist/patches, etc.)
      • meta.txt and basic dsynth profile/config snapshots
  • Design/usage documentation in docs/AGENTIC_BUILDS.md describing:
    • the overall automated-fixing workflow (bounded evidence → triage → snippet escalation → patch → rebuild)
    • an opencode integration plan, including a central queue model for asynchronous triage
  • A small README pointer to the hook location.

What this PR does not do (yet)

  • No network calls from hooks.
  • No queue writer/runner implementation.
  • No automated patch application.

Those are intentionally deferred so this PR can land the core evidence-capture mechanism safely and independently.

How to try it

  1. Install hooks by copying/symlinking scripts/dsynth-hooks/hook_* and scripts/dsynth-hooks/hook_common.sh into dsynth’s config base (/etc/dsynth/ or /usr/local/etc/dsynth/) and making them executable.
  2. Run dsynth normally.
  3. On a port failure, inspect ${Directory_logs}/evidence/runs/.../ports/.../ for the evidence bundle.

Why this matters for automated fixing

Reliable, size-capped evidence capture is the prerequisite for an automated port-fixing system:

  • the triage agent needs consistent inputs (errors.txt + port metadata)
  • the patch agent can generate DeltaPorts-style diffs based on evidence, not guesses
  • the rebuild loop stays dsynth-driven, and automation can be layered on without destabilizing build infrastructure

tuxillo added 30 commits May 15, 2026 00:59
Add dsynth hook scripts that snapshot distilled build errors and relevant port metadata on failures, grouped by run, so debugging can stay build-driven without keeping full workdirs.

Document the bounded evidence contract and the planned opencode integration/central queue model for asynchronous triage.
Add observe-only state server for remote UI integration:
- REST API for runs, jobs, bundles, ports, artifacts
- SSE event stream with replay support
- SQLite persistence for full history
- Filesystem reconciler for live updates

Validated on DragonFlyBSD VM - all endpoints tested.
- Add vanilla JS Bootstrap 5 UI served by state-server
- Live SSE event stream with replay/reconnect
- Views: Overview, Events, Jobs, Runs, Ports, Bundles
- Artifact viewer for markdown, diffs, logs
- SSE improvements: after_id, tail query params, ts in payloads
- Add /bundles API endpoint listing recent bundles
- Add #/bundles route with renderBundles() view
- Add Bundles nav item to navbar
- Update Phase 9 docs with completion status and new route
- agent-queue-runner: add apply job type and iteration tracking
- apply-patch: add DragonFly local mode, --no-push flag, BSD-compatible patch
- hook_common.sh: detect rebuild iterations, track previous bundles
- Add KEDB entry for DragonFly source patch conventions
Makefiles use tabs, not spaces. The agent was generating patches with
spaces which caused patch application failures. Added rule #8 to
emphasize preserving exact whitespace from the bundle context.
When retrying a patch application, the branch may already exist from
a previous failed attempt. Delete it first to allow the retry.
Stop extraction when hitting common section markers like 'Rationale',
'Files Modified', etc. Also detect when prose text starts after hunks.
This prevents non-diff content from being included in patch.diff.
The agent was generating patches with incorrect hunk line counts.
Added detailed instructions on unified diff format with example.
- Change dports-patch prompt to request complete file contents
- Add extract_files_from_response() to parse FILE content blocks
- Add generate_unified_diff() to create diffs programmatically
- Add generate_combined_diff() for multi-file patches
- Update write_patch_outputs() to try new format first, fallback to legacy

This fixes the malformed diff issue - LLMs are good at generating
file content but struggle with unified diff syntax and line counts.
The agent was outputting diff syntax inside FILE blocks for Makefile.DragonFly.
Make it explicit that Makefile.DragonFly should be raw makefile content,
while dragonfly/patch-* files are actual diffs.

Also add specific hint for the IFM_IEEE80211_VHT5G error.
…er UI

- Add activity_log and runner_status tables to state-server schema
- Add /activity and /runner-status API endpoints with SSE events
- Update agent-queue-runner to log activities at all job stages
- Add heartbeat thread for runner liveness detection (5s interval)
- UI: Add Activity Log panel showing last 10 runner activities
- UI: Add Runner Status indicator with staleness detection (>15s)
- UI: Add back button for artifact navigation in bundle view
- UI: Hide session_id.txt files from artifact lists
…b error display

- state-server: Only emit runner_status SSE events when status/job_id/stage
  changes, not on every heartbeat update_at change
- app.js: Don't trigger full re-render for runner_status/activity events
  (fixes bundle tab reset issue), only re-render on overview page
- app.js: Add renderJobDetail() with prominent error display and related
  activity log entries for failed jobs
- agent-queue-runner: Write .job.error files before moving failed jobs,
  move error files along with job files
tuxillo and others added 30 commits May 17, 2026 21:21
… + budget to phase 3

Phase 2 collapses to retiring apply-patch entirely after confirming its
responsibilities already live elsewhere (iterative loop -> Phase 3 harness,
PR creation -> existing process_pr_job). Phase 3 gains a trust-tier policy
(AUTO/ASSIST/MANUAL) sourced from config/agentic-policy.json plus
budget-bounded intra-job iteration (max iterations + max tokens via
litellm response.usage). PR/push is intentionally out of scope of the
iterative loop; the loop ends at a local rebuild_proof.json.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…e note

process_apply_job in agent-queue-runner was an 11-line stub never dispatched
by the runner's job-type table; deletion is cosmetic. AGENTIC_BUILDS.md
phase note drops the apply-patch reference and clarifies PR creation is
out of scope of the iterative patch loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phases 1 and 2 are shipped. Phases 4 and 5 are deferred. The plan
document is now a focused Phase 3 implementation plan: opencode +
TS plugin retire in favor of a Python harness (litellm-based) under
dportsv3.agent, with tools dispatching in-process to a refactored
agentic-worker module. Trust-tier + budget policy from
config/agentic-policy.json drives auto-iteration. Snippet rounds
fold into the triage call.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New Python package under scripts/generator/dportsv3/agent/ that will
host the litellm-based replacement for opencode. This commit lands the
scaffolding only — nothing wires it into agent-queue-runner yet.

Modules:
- llm.py: litellm wrapper with normalized Response (text, tool_calls, usage)
- prompts.py: TRIAGE_SYSTEM (verbatim from config/opencode/agent/dports-triage.md)
- policy.py: load_policy / tier_for, applying confidence_floor downgrades
- snippets.py: subprocess wrapper around scripts/snippet-extractor
- triage.py: single-LLM-call flow with snippet rounds folded in-process

Plus:
- config/agentic-policy.json: AUTO/ASSIST/MANUAL tiers + classification map
- pyproject.toml: new optional-dependency 'agent = ["litellm"]'

litellm's only Rust-built transitive dep is pydantic-core, satisfied by
the generator venv's --system-site-packages reading py311-pydantic-core
from pkg.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…hing/PR

The agentic-worker's workspace concept (/build/synth/agentic-workspace,
workspace.json, separate FPORTS/DPorts) overlaps with dev-env; pick
dev-env as the single isolation primitive. The patch agent's tool
surface is reimplemented on top of dev-env exec + writable overlay
operations.

Also retract phase 2's "keep process_pr_job for manual use" — the
loop is purely local. Branches, commits, push, gh pr create all die.
agentic-worker the standalone script also dies (596 LOC); functions
live in dportsv3.agent.worker.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ubcommands, rebuild_proof schema

- Concrete edits table now covers the previously missed cruft in
  agent-queue-runner: VM_SSH_* constants/env/dispatch (lines 17-21,
  71-73, 721-761), DEFAULT_WORKSPACE_CONFIG + workspace.json loader
  (76, 167).
- Step 2 gains two prerequisite dev-env subcommands:
  'dportsv3 dev-env status NAME' (JSON readiness) and
  'dportsv3 dev-env path NAME [--writable]' (~25 LOC total). The
  worker uses them as the only interface to dev-env state, no
  re-parsing of dev-env's internals.
- Step 4 pins the new rebuild_proof.json schema: origin, rebuild_ok,
  dsynth_profile, build_command, timestamp_utc. No branch/head/fports
  fields.
- Step 6 retitled to cover opencode + VM_SSH + workspace cruft in one
  cleanup pass; negative check grep extended accordingly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When DP_HARNESS_TRIAGE_MODEL is set in the environment, route triage
through the in-process litellm harness instead of opencode. Snippet
rounds fold into the harness call; the runner no longer re-enqueues
for them. When DP_HARNESS_TRIAGE_MODEL is unset, the existing
opencode path runs unchanged.

- sys.path bootstrap so the standalone runner can import dportsv3.agent
  from scripts/generator/.
- New _process_triage_job_harness helper carrying the harness path:
  build payload (unchanged), call dportsv3.agent.triage.run, write
  triage.json audit (classification, confidence, snippet_rounds,
  tokens, model, via), consult needs_user_context / should_enqueue_patch
  the same way the opencode path does.
- New _write_triage_audit_harness writer for the new triage.json shape.

The new triage.md on disk is written by the harness itself as it runs
(needed so snippet-extractor can read the requests for the next round).

Env vars used: DP_HARNESS_TRIAGE_MODEL (required to take this path),
DP_HARNESS_TRIAGE_API_BASE, DP_HARNESS_TRIAGE_API_KEY,
DP_HARNESS_TIMEOUT (default 120), DP_HARNESS_MAX_SNIPPET_ROUNDS
(default 5).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two read-only subcommands the agent harness will use to query env
state and resolve host-side paths:

  dportsv3 dev-env status NAME
    Prints a single JSON line with name, target, origin, status,
    backend, oracle_profile, root_mounted, env_dir.

  dportsv3 dev-env path NAME [--writable]
    Prints env_dir (default) or env_dir/writable (with --writable).

Backed by the existing EnvironmentStore.{load,env_dir,writable_dir}
and mounts.mounts_under. No new dev-env behavior.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New module dportsv3.agent.worker implementing the harness's tool
surface on top of dev-env primitives. Step 2b lands the host-side
functions (no chroot execution yet; chroot ops land in 2c).

Path resolution:
- EnvPaths dataclass + env_paths(env) — shells out to
  'dportsv3 dev-env path NAME [--writable]' (cache once per job).
- _resolve_chroot_path(paths, '/work/...') → host-side Path under
  env_dir/writable. Rejects paths outside /work/ and any .. escape
  via Path.relative_to.

Tool functions:
- env_verify(env) — wraps 'dportsv3 dev-env status NAME'; raises
  unless status=='ready' and root_mounted.
- get_file(env, path) — base64-encoded read; returns sha256 + size.
- put_file(env, path, content, encoding='text'|'base64',
  expected_sha256=None) — write with optimistic-lock check;
  preserves file mode on existing files.
- emit_diff(env, origin, relpath) — git diff against HEAD in the
  env's DeltaPorts overlay; never commits or stages.
- grep(env, pattern, path, include=None, max_bytes=8192) — rg over
  the writable overlay, output capped.

Verified with a temp-filesystem + git-init test harness: get_file
roundtrip, put_file text/base64/lock-match/lock-mismatch, emit_diff
finding modifications, grep finding pattern matches.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…unted

Two issues surfaced testing on dfly:

1. subprocess.run(['dportsv3', ...]) failed with FileNotFoundError
   because the generator venv's bin/ is not on PATH. Fix: invoke as
   [sys.executable, '-m', 'dportsv3', ...] so it uses the current
   interpreter's installed dportsv3 package regardless of PATH and
   skips the wrapper script's bootstrap. Override-able via DPORTSV3_CMD.

2. env_verify raised on root_mounted=false for envs in 'ready' state
   that hadn't been shelled into. That's a legitimate state — host-
   side tool ops operate on the writable overlay directly and
   'dev-env exec' auto-mounts on demand. Drop the root_mounted gate.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the six chroot-bound functions to dportsv3.agent.worker. All
shell out via 'dportsv3 dev-env exec ENV -- CMD'; dev-env auto-mounts
the env root on demand.

- materialize_dports(env, origin): 'reapply ORIGIN' (existing dev-env
  helper wrapping dportsv3 compose).
- extract(env, origin): 'make -C /work/DPorts/<origin> extract', then
  queries WRKDIR/WRKSRC via 'make -V' so the LLM can address files in
  the extracted source.
- dupe(env, path): 'dupe PATH' (in-chroot tool; clones source file
  with .orig backup so genpatch can later produce a unified diff).
- genpatch(env, path): 'genpatch PATH'; returns list of generated
  patch-* files from /work/genpatch-out/.
- install_patches(env, origin, patches=None): host-side shutil.copy2
  from <writable>/work/genpatch-out/ into
  <writable>/work/DeltaPorts/ports/<origin>/dragonfly/. No chroot exec
  needed since both source and destination are in the writable overlay.
- dsynth_build(env, origin): 'dbuild ORIGIN' (existing dev-env helper);
  returns rc + stdout/stderr + rebuild_ok=(rc==0).

Refinements per review:
- env_paths() now @lru_cache'd so repeated tool calls in one attempt
  pay the dev-env subprocess cost once.
- Module docstring documents the choice to drive dev-env via its CLI
  (stable contract) rather than importing EnvironmentStore (internals).

The tool surface is now complete for step 3 (tools.py registry).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…rtsv3

python -m dportsv3 skips the bash wrapper at the repo root that knows
how to dispatch the 'dev-env' subcommand to a separate venv. So
'python -m dportsv3 dev-env path 2026Q2' fails with "invalid choice:
dev-env" because the generator's argparse doesn't include it.

Resolution order for the wrapper:
  1. DPORTSV3_CMD env var override
  2. <repo>/dportsv3 sibling lookup (4 parents up from worker.py)
  3. shutil.which('dportsv3') on PATH

Lazily resolved on first use so import never fails when the wrapper
isn't reachable (e.g. unit tests outside the repo).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LLM-facing tool functions (materialize_dports, extract, dupe, genpatch,
dsynth_build, emit_diff) now return a uniform shape:

  {
    "ok": bool,                  # rc == 0
    "rc": int,
    "stdout_tail": str,          # last 32KB if longer
    "stderr_tail": str,
    "stdout_truncated": bool,
    "stderr_truncated": bool,
    ...tool-specific keys...,
  }

Tail-preservation matters for build errors (the useful diagnostics
live at the end of the log, not the start). The LLM inspects 'ok' +
the tails to decide what to do — no more opaque RuntimeError(2000
chars of mounting INFO logs) burying the real make/dsynth error.

Infrastructure helpers (env_paths, env_verify) still raise — those
are fatal setup errors that don't make sense to surface to the LLM.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The dev-env chroot doesn't mount the dports tree at the conventional
/usr/dports path; /work/DPorts is the writable overlay where compose
materializes ports. Without PORTSDIR set, /usr/share/mk/bsd.port.mk
fails to open /usr/dports/Mk/bsd.port.mk and 'make extract' dies
before doing any work.

Fix: pass PORTSDIR=/work/DPorts on every 'make' invocation (both
the extract step and the WRKDIR/WRKSRC query). Define PORTSDIR as a
module constant in worker.py so future make-based tools share it.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…re writable

DragonFly's bsd.port.mk defaults WRKDIRPREFIX=/usr/obj/dports for
build artifacts, but /usr/obj is read-only in the dev-env chroot
(it's part of the base mount). 'make extract' progressed past the
PORTSDIR fix but then died with 'mkdir: /usr/obj/dports: Read-only
file system' while creating .extract_done markers.

Fix: point WRKDIRPREFIX at /work/obj (writable, under the env's
writable overlay). Also pass BATCH=yes so the ports config dialog
doesn't try to prompt for an unattached tty.

Factored the common overrides (PORTSDIR, WRKDIRPREFIX, BATCH) into
_make_vars() since future tools that call make will need the same.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 596-line standalone worker managed its own /build/synth/agentic-
workspace/ (DeltaPorts, FPORTS, DPorts, workspace.json) and was
invoked over SSH by config/opencode/tool/dports.ts. dportsv3.agent.worker
(landed in 2b/2c) replaces it on top of dev-env primitives — same
tool surface, no separate workspace concept, no SSH.

The TS plugin at config/opencode/tool/dports.ts still references the
old worker path; it's slated for deletion in step 6 (retire opencode).
Until then, the opencode-driven patch path is broken — which is fine
because the prior agentic stack had no working production users and
the harness patch flow lands in step 4.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per the plan: no PRs, no branches, no push. The loop is purely local.
process_pr_job's job (git push + gh pr create from rebuild_proof.json's
deltaports_branch + deltaports_head fields) doesn't fit the new model:
the harness never commits or branches in the env's writable overlay,
so those fields won't exist in the new rebuild_proof.json schema either.

Deleted:
- process_pr_job function body (~102 LOC)
- 'type == "pr"' dispatch arm in process_job
- dry-run handling for type=pr in process_job

Side effects to be cleaned in step 6 with the rest of the opencode/
workspace sweep:
- DEFAULT_WORKSPACE_CONFIG and load_workspace_config remain because
  build_triage_payload still embeds workspace.json in the LLM prompt.
  Dead in spirit; step 6 sweep removes them along with VM_SSH and
  opencode plumbing.

Enqueueing a type=pr job now hits the 'unknown job type' fallback,
which is correct behavior.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The awk-driven excision of process_pr_job in 316b4a1 produced
a non-executable file. chmod 755 to restore.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dportsv3.agent.tools — 11 hand-written OpenAI-format tool schemas
matching the worker function surface, plus a dispatch helper:
- env arg bound by caller (patch.run in step 4), not exposed to LLM
- inspect-based arg validation (reject unexpected args; flag missing
  required args)
- catch + surface worker exceptions as {ok: false, error: ...,
  traceback: ...} so the LLM can recover on the next turn
- workers already return {ok, ...} dicts; passthrough preserved

dportsv3.agent.tool_loop — multi-turn driver:
- call llm.complete with messages + tool schemas
- on tool_calls: dispatch each, append assistant+tool messages,
  re-call
- stop on text-only response or max_turns=20 safety cap
- returns (final_response, accumulated_usage)

The assistant message is rebuilt from our normalized Response
(role=assistant, content, tool_calls[]) rather than relying on
litellm's internal raw shape — provider-portable.

Verified with a stubbed LLM:
- 3-turn happy path: 2 tool calls then text-only; history shape
  [system, user, assistant, tool, assistant, tool]; usage summed
- bad tool name surfaced as {ok: false, error: "unknown tool: ..."}
- missing required arg surfaced as {ok: false, error: "missing ..."}
- worker FileNotFoundError surfaced as {ok: false, error: "...",
  traceback: "..."}

No new external deps.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DragonFly's py311-tokenizers tokenizers.abi3.so has missing
DT_NEEDED entries (libonig, esaxx) — loading it fails at import
time with chains of "Undefined symbol" errors that patchelf
doesn't fully resolve. litellm transitively imports tokenizers
for local cost calculation, but we don't need local token counting
(usage totals come from response.usage.total_tokens).

llm.py now tries to import tokenizers up-front. If that fails,
inject a no-op stub into sys.modules with Tokenizer/Encoding so
litellm's import chain succeeds. On platforms where tokenizers
works (Linux, macOS), the stub never runs.

The stub exposes only the surface litellm touches at import time
(Tokenizer.from_pretrained, .from_file, .encode); calls return
empty results. Cost calculation will be inaccurate or fall back to
heuristics, which is fine because we never invoke
litellm.cost_calculator.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
llm.py's tokenizers stub only fired when llm was imported — but
the runner / tools modules / manual inspections can hit litellm
without going through llm first. Moving the stub to
dportsv3/agent/__init__.py makes it run as soon as any module
under the package is imported.

This unblocks invocations like:
  python -c "import dportsv3.agent; import litellm; ..."

without needing to pre-import llm.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ting

When litellm's model-name → provider heuristic mis-routes (e.g., any
model name containing 'deepseek' or 'claude' is shunted to the native
provider client even when openai/ prefix and api_base are set),
custom_llm_provider forces a specific code path.

Generic passthrough; default None means "let litellm pick from prefix
as before." Set per flow:

- agent-queue-runner: DP_HARNESS_TRIAGE_PROVIDER env var
  (DP_HARNESS_PATCH_PROVIDER will follow in step 4 when patch wires)
- llm.complete(), tool_loop.run(), triage.run(): custom_llm_provider
  kwarg
- _manual_test_tool_loop: DP_TEST_PROVIDER env var

Native providers (anthropic/, deepseek/, nvidia_nim/, ...) work
unchanged because they don't set custom_llm_provider. The override is
only used when needed (most often: openai-compat third-party endpoints
with model names that fool the heuristic).

Also commits the manual test helper for tool_loop that was previously
left untracked. Useful while step 4 is in flight.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Thinking-mode providers (DeepSeek v4-pro/v4-flash directly or via
opencode.ai/zen, OpenAI o-series via some relays) emit a
reasoning_content field alongside content + tool_calls, holding the
model's intermediate chain-of-thought. The upstream API requires
this field to be passed back on the next request, or the multi-turn
call fails with HTTP 400:

  "The reasoning_content in the thinking mode must be passed back
  to the API."

Changes:
- llm.Response gains optional reasoning_content field; llm.complete
  extracts it from msg.reasoning_content if present (None otherwise).
- tool_loop._assistant_message_from includes reasoning_content in
  the reconstructed assistant message when set, so the next LLM
  request preserves continuity.

No-op for non-thinking models — reasoning_content stays None,
nothing extra is sent.

Verified with stubbed Response objects: thinking-mode reconstructed
message carries reasoning_content; non-thinking does not.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previously every get_file result was base64. For UTF-8 text files
(Makefiles, patches, source, the bulk of what the agent reads), this
inflated content by ~33% AND made the model mentally decode base64
to find anything inside — burning prompt AND completion tokens.

Now: read bytes, try UTF-8 decode with a NUL-byte sanity check;
return {encoding: 'text', content: <str>} on success, fall back to
{encoding: 'base64', content: <b64>} for binary. sha256 is computed
over the raw bytes, so put_file's expected_sha256 round-trip works
regardless of encoding.

Verified with a temp-fs harness: text Makefile returns text;
PNG-header file returns base64.

Schema description updated so the LLM understands the dual-mode
return shape. Example path in description updated to /work/DPorts/...
(the common path; agent reads materialized port files from DPorts,
edits source-of-truth in DeltaPorts).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The patch agent now runs end-to-end through the harness instead of opencode.

New code:
- prompts.PATCH_SYSTEM: 4kB system prompt spelling out the dev-env's
  three-tree layout (freebsd-ports / DeltaPorts / DPorts), tool
  vocabulary, the repair loop, discipline rules (no commits/push/PRs),
  and the mandatory output format ending in the new rebuild_proof.json
  schema (origin, rebuild_ok, dsynth_profile, build_command,
  timestamp_utc — no branch/head/fports fields).
- attempt_loop.run: budget-bounded retry around tool_loop. Each
  attempt is a fresh [system, user] conversation (with a small failure-
  context user turn appended on retries) so tool-call traces don't
  compound across attempts. Stops on rebuild_ok=true, budget exhaustion,
  or max_iterations. Returns PatchResult{status, final_text, usage,
  attempts[], proof}.
- patch.run: thin wrapper over attempt_loop.run.

Runner wiring (mirrors step 1 triage adapter):
- New env vars: DP_HARNESS_PATCH_{MODEL,API_BASE,API_KEY,PROVIDER,
  TIMEOUT}, DP_HARNESS_ENV (dev-env name default), DP_HARNESS_POLICY
  (optional override of config/agentic-policy.json path).
- process_patch_job: when DP_HARNESS_PATCH_MODEL is set, route to
  _process_patch_job_harness. It reads triage.md, resolves the tier
  via policy.tier_for(classification, confidence), and calls
  dportsv3.agent.patch.run with the tier's budget.
- Bundle outputs: analysis/patch.md (final LLM text), analysis/
  rebuild_proof.json (parsed proof block), analysis/patch_audit.json
  (status + tokens + per-attempt info + model), analysis/changes.diff
  (host-side git diff vs HEAD in the env's DeltaPorts overlay).

Verified attempt_loop against a stubbed tool_loop:
- success on first attempt
- failure then success (failure-context message added to retry)
- budget exhausted mid-sequence
- needs-help after all attempts fail
- missing rebuild_proof JSON falls back to needs-help

End-to-end against a real LLM + env requires a manual smoke run with
DP_HARNESS_PATCH_MODEL + a bundle on disk; covered in the next message.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
_manual_test_patch_flow.py fixtures a minimal bundle under /tmp
(meta.txt, errors.txt, analysis/triage.md) and invokes
dportsv3.agent.patch.run directly with a fabricated payload —
bypassing the queue runner so the harness's loop is exercised in
isolation against a real LLM + real dev-env.

The fixture intentionally doesn't simulate a broken port; it asks
the agent to verify the current state of the port via dsynth_build
and emit rebuild_proof.json accordingly. Pointing at devel/readline
(default) should reach rebuild_ok=true within 1-2 attempts.

Env vars mirror _manual_test_tool_loop (DP_TEST_MODEL, ENV, ORIGIN,
TIER_ITERATIONS, TIER_TOKENS, plus PROVIDER/API_BASE/API_KEY).

The bundle dir is preserved on exit so you can inspect the artifacts
the runner-side adapter would have written: patch.md, patch_audit.json,
rebuild_proof.json, changes.diff (note: those are written by
agent-queue-runner's _process_patch_job_harness, NOT by this fixture
— this fixture only calls patch.run and reports the PatchResult).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dsynth's 'build' subcommand asks interactive questions (most commonly
"Rebuild local repository? [Y/n]" before scanning, sometimes follow-
ups during the build). The agent has no tty, so the subprocess sat
in [ttyin] state and the patch flow hung — observed mid-test:

  load: 0.67  cmd: dsynth 31619 [ttyin] 0.00u 0.06s 0% 4128k

Fixes:
- worker._exec accepts optional input_text kwarg; default stdin is
  empty string (effectively /dev/null) so unexpected prompts fail
  fast rather than blocking.
- worker.dsynth_build pipes 'y\\n' * 50 to stdin to clear dsynth's
  prompts. Generous enough for multi-question build cycles, cheap
  to send.

dbuild (the dev-env helper) is unchanged — humans running it
interactively still get the prompts.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…_turns default

Observed: a single attempt burned 2,073,090 tokens before
attempt_loop's between-attempts budget check caught it. Root cause:
tool_loop only enforced max_turns (30), not the token budget. The
model went into a tool-call frenzy and attempt_loop only noticed
after 30 turns of accumulating 70k-token contexts.

Fixes:
- tool_loop.run: new max_tokens kwarg; checked at the top of each
  turn before issuing the LLM call. When the running total reaches
  the cap, return whatever Response we have. Default 0 = no cap
  (callers should pass remaining budget).
- attempt_loop.run: passes tier's remaining budget (max_tokens -
  tokens_used_so_far) as max_tokens to tool_loop on each attempt.
  Also short-circuits with status=budget-exhausted before kicking
  off a new attempt if the budget is already gone.
- tool_loop max_turns default: 20 -> 12. A patch task taking more
  than ~12 tool calls per attempt is in trouble; the cap should
  stop it sooner.
- attempt_loop max_tool_turns default: 30 -> 12.

Verified with stubbed LLM: tool_loop stops at 1200 tokens when
max_tokens=1200 (turn 3 was the first check after total>=cap).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When the patch fixture run produces a surprising token count, we
need to see what the model actually did — final_text alone tells us
nothing if the loop ended on a tool call.

_install_session_dump wraps llm.complete and tools.dispatch to write
each turn as a JSON line to <bundle>/session.jsonl:

- llm_call records: messages_preview (with long strings truncated
  to 800 chars), response.text (1200 chars), tool_calls,
  reasoning_content (600 chars), usage.
- tool_dispatch records: tool name, arguments, ok flag, stdout/stderr
  tails truncated to 600 chars. Excludes result body (file bytes,
  full schemas) to keep the trace compact and shareable.

After a run, share session.jsonl and the per-turn behavior is
visible without re-running.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant